Genomic approaches to the genetics of alcoholism.

When studying complex diseases such as alcoholism that develop as a result of numerous genetic and environmental factors, researchers can use the sequence data that have become available both for the human and for animal genomes. For these analyses, investigators are being aided by efforts to identify and characterize functionally relevant DNA sequences in the entire genomic DNA sequence--a process called annotation. Various bioinformatics and annotation tools can help in this enterprise. These include four primary approaches: (1) precomputed, annotated public Web sites that provide a plethora of information; (2) in-house analyses from which users can choose the appropriate analyses for their purposes; (3) Web-based annotation systems that analyze a user's DNA sequence; and (4) private resources that provide access to annotated genomic sequences at cost. In addition to careful study of the DNA sequence for clues about function, expression studies of mRNA levels using gene chips provide information about the activity levels of thousands of genes that may vary in different tissues, different animals and people, or under different environmental conditions.

S ince the 1980s, researchers have attempted to identify genes that underlie various diseases. Ini tially, these efforts focused mainly on relatively rare genetic diseases-such as cystic fibrosis and Huntington's disease-that because of their simple inheritance patterns were likely to be caused by only one gene. Using fam ily studies to search for certain DNA segments (i.e., markers) that occurred only in people affected by these dis eases, researchers successfully identi fied the underlying genes. More recently, the field of genetics has entered an era in which much of the focus has shifted to the study of more complex disorders-such as cancer, diabetes, hypertension, schizophrenia, and alcoholism-that are believed to develop as a result of a combination of numerous genetic and environ mental factors.
The genetic study of complex diseases poses challenges that are not typically associated with single-gene disorders. For example, complex dis eases are generally more common than single-gene disorders, tend to involve multiple genes, also include significant environmental factors, and are associated with a variety of charac teristics and behaviors, or phenotypes, that are not simple to describe. To separate the multiple genetic and environmental components that underlie these diseases, researchers have developed more sophisticated methodological and statistical tech niques. These techniques exploit knowledge about the inheritance of chromosomes, about the reorganiza tion of genetic material that occurs during the generation of eggs and sperm (i.e., recombination), and about the analysis of quantitative traits-characteristics, such as height or intelligence, that vary along a con tinuum in the population. Based on this knowledge, investigators can identify specific DNA markers that are linked to complex diseases, thereby delineating regions within the genetic material of the cell that are likely to contain genes which contribute to complex phenotypes. These regions are called quantitative trait loci (QTLs). Many different alcoholrelated QTLs have been identified in recent years, both through family studies of human populations and through research in model organisms, such as mice, rats, and flies.
Concurrent with the efforts to identify QTLs in alcohol studies, a vast array of bioinformatics resources have become available as the result of proj ects to decipher the entire genetic information (i.e., the genome) of vari ous organisms. After the March 2000 publication of the genome sequence of the fruit fly Drosophila melanogaster, a draft of the entire DNA sequence of the human genome was completed, and in December 2002, public access to a draft sequence of the complete mouse genome became available (Waterston et al. 2002).
A major challenge facing genome researchers is to make sense of the vast amounts of "raw" genome sequence data that are being generated. For example, only about 1.5 to 2 percent of the genome is thought to represent gene-coding regions-that is, regions that specify what protein product will be produced. Most of the DNA in between the genes is noncoding and its functions are not fully understood. Some of those DNA sequences serve to regulate gene expression-where, when, and to what degree the gene is active. In addition, at least in higher organisms, genes typically consist of different pieces called exons, which contain the actual coding regions and which are interrupted by noncoding sequences called introns. When a gene is expressed, an RNA copy of the gene is produced that is then pro cessed so that intron regions are removed and exon regions spliced together, forming a mature mRNA 1 that carries the information to pro duce a specific protein product.
To identify genes associated with a disorder or characteristic, one must be able to distinguish genes from non-coding sequences, exons from introns, and regulatory signals from other DNA sequences. The process of iden tifying these various DNA sequences and signals within the vast amount of genomic DNA is called annotation. In recent years, a variety of bioinfor matics tools have become available for annotating the increasing amounts of raw sequence data for various organisms. Likewise, additional projects are in progress to identify subtle genetic differences called single-nucleotide polymorphisms (SNPs) that exist among individuals in human popula tions and which can serve as addi tional DNA-based markers for analyses of the human genome.
Based on the information from the draft sequence of the human genome, researchers have estimated that the total number of human genes is approximately 30,000 to 40,000 (Claverie 2001). Alterations in the coding regions of genes themselves can account for some of the genetic differ ences among individuals and also can be responsible for the development of various disorders. However, individual differences as well as various genetic diseases can also result from other changes in the genome, including differences in the sequences that affect how a gene is regulated (e.g., how the information within a gene is spliced together or how the protein products of genes are assembled and modified). Annotating the genome sequence and investigating variability in the geno mic sequence using the data coming out of the various genome-sequencing efforts require advanced bioinformat ics tools. Many such programs and resources are now available and more are under development. Alcohol researchers can use these tools to iden tify potentially important genetic variations, such as gene-coding region changes or gene regulatory differences that may influence alcohol-related phenotypes.
This article provides an overview of the status of various genome proj-1 An mRNA is a molecule generated in the nucleus, where the DNA is located, and serves as a template for the production of the corresponding gene product (i.e., protein), which occurs outside the nucleus. ects, reviews several technologies used to determine differences in gene structure and expression, and summa rizes some of the applications of these tools and technologies in the alcohol research field. Because full explana tions of these highly technical analyti cal methods would go beyond the scope of an overview article, some of the information presented here will necessarily be abbreviated and simpli fied. The reader is referred to other sources, such as the numerous Web sites listed in tables 1 and 2, for more specific information on the tools discussed in this article.

Genome Projects: Status of the Sequencing Efforts
In February 2001, researchers pre sented two draft versions of the human genome. One version was generated by researchers participating in the publicly funded Human Genome Project (HGP), which had been initiated in 1988. The HGP investigators first generated a rough map of the genome and then determined the exact DNA sequences of individual segments of that map. The second draft of the genome was gen erated by Celera, a genomics com pany established in 1998. Celera investigators employed a slightly dif ferent strategy in which they determined the DNA sequence of randomly selected DNA segments-a process called shotgun sequencing-and combined this information with data available from the HGP to obtain an assembled human genome sequence. Both groups published the results of their work the same week in the sci entific journals Nature and Science, respectively (Lander et al. 2001;Venter et al. 2001). Interested scien tists can access both data sets online, although access to the Celera database must be purchased. The availability and utility of these resources offer alcohol researchers unprecedented opportunities for identifying genes and molecular mechanisms that contribute to the development of alcoholism.
Numerous investigators have now mapped and sequenced the mouse genome (Waterston et al. 2002). Gen eticists expect that the high degree of similarity between the human and mouse genomes will facilitate naviga tion back and forth between model organism and human, both at the DNA level and at the level of analyz ing the functions of specific genes. Currently, nearly completed sequences for the mouse have been generated and assembled by both public and private (Celera) mouse genomesequencing efforts. The data produced by the public sequencing effort are available online from the University of California at Santa Cruz, the National Center for Biotechnology Information, and Ensembl, which maintain public genome databases in the United States and United King dom. These resources for the mouse genome have been designed to mirror similar genomics resources for the human genome, located at the same Web sites, described below.

Bioinformatics and Annotation Tools
Now that the DNA sequences cover ing the entire genomes of many dif ferent organisms have become available, the development of meth ods to decipher the roles of the genes residing within the DNA sequences has become a challenging but impor tant task (Stein 2001). Even after the draft versions of the human genome have been completed, the functions of the great majority of the 3 billion nucleotides making up the genome remain unknown. To meet the chal lenges of identifying genes and other features in genomic sequences, a vari ety of bioinformatics tools have been developed.
As mentioned earlier, the part of a gene that actually encodes a gene product (typically a protein) is located in one or more exons. Because of their importance to genetic disease pro- within QTLs that have been identi fied through other analyses). Several tools are available for this purpose, including various gene prediction programs, tools that search for simi larities (i.e., homologies) between sequences, and other resources for detecting elements related to gene expression that are conserved across species.
Most of these tools are based on the hypothesis that because many important cellular functions and processes are highly conserved among diverse organisms, certain DNA sequences also have universal func tions. For example, sequences signal ing the beginning of a gene and splicing signals will be very similar for all genes in an organism and for dif ferent organisms. Accordingly, com puter programs can search DNA sequences for the presence of poten tial start and splice signals and thus predict the location of exons. The various exon prediction programs available use three main algorithmic approaches that differ slightly among each other. 2 The most effective means of identifying DNA sequences that are likely to represent true exons is to use multiple exon prediction programs to analyze a region of interest, because researchers can then focus on those regions that are predicted by more than one program (Fortna and Gardiner 2001). Identifying other DNA sequences with known func tions that are highly conserved can provide additional evidence that sequences identified as potential exons are indeed parts of actual genes.
Alcohol researchers can use four primary approaches, discussed in the following sections, to identify genes that are present in alcohol-related QTLs: • Precomputed annotated public Web sites can provide all of the necessary information for many applications and require minimal computer expertise. • In-house analyses allow users to choose specific programs for identifying exons or other genomic sequence landmarks; these analyses can be cesses, it is crucial to identify exons 2 These approaches are called neural network methods, within genomic DNA sequences (e.g., rule-based systems, and Hidden Markov Models (HMMs).
useful for researchers with bioinfor matics and computer expertise.

Xpound
Gene Finder Local Software for exon trapping based on maximum likelihood methods (rule-based) (Kamb et al. 1995) MZEF Gene Finder Local Predicts putative internal protein coding exons in genomic DNA (rule-based) sequences, starts with potential exon and calculates posterior exon probability (Zhang 1997(Zhang , 1998  Ensembl. Like the other two sites, Ensembl provides map views of the genome (see figure 3). In addition, Ensembl generates specific reports for each search that is conducted, including such information as a list of the aliases under which a given marker might appear in other databases. A search for a specific gene generates an Ensembl Gene Report that lists information about the gene's location and function, methods used to predict the gene, links to other databases, and other relevant information.

In-House Methods
For researchers who want to perform their own analyses without relying on public databases, two primary programs, or workbenches, are available to facilitate annotation and analysis. These workbenches are called Geno tator and Alfresco. Overall, these programs are very similar, although Alfresco can be downloaded onto a Macintosh, PC, or UNIX computer, and Genotator must be installed on a UNIX computer. The results of vari ous types of sequence analyses, which are described in the following sec tions, can be plugged into these workbenches for combined analysis.
Conducting their own analyses of genomic sequences holds several advan tages for researchers. First, the public annotated sites are only updated every few months, so the annotations may not include the most recent informa tion. Other databases, however, are updated daily and can be accessed during an in-house analysis for up-tothe-minute examinations. Second, at present the annotations found on pub lic Web sites are generated using only selected exon prediction programs, whereas specialized workbenches allow for incorporation of many other programs. Third, the flexibility of a workbench also allows inclusion of additional types of analyses, depending on the interest of the investigator. Finally, the browsers associated with these two popular workbenches have been designed to simplify analysis of the results.
The following sections describe three classes of analyses that can be performed on genomic DNA sequences-homology searches, exon prediction, and other analyses. (For a listing of these tools and information about them, see table 2.) Results from each of these analyses can be incorpo rated into the Genotator or Alfresco browser for simplified viewing of the combined results (see figures 4A and 4B) and for further information.

Homology
Searches. An important step in analyzing unknown DNA sequences is to search for similarities, or homologies, with already known sequences because such homologies can give an indication of the function of the unknown sequence. For exam ple, genes with similar functions (e.g., binding to DNA to regulate gene expression) often are characterized by specific DNA sequences. Similarly, related genes of humans and other organisms-and even noncoding DNA sequences-are often highly homologous, and identification of a known mouse DNA sequence with homologies to an unknown human sequence could suggest the function of the human sequence.
To identify homologies, one must first identify and block DNA regions in which short sequences are repeated several times. These repeats, which often are very similar among organisms but are so common that they provide no substantial information, can inter fere with the analysis of other homolo gies. One program to identify and then "mask" these repeats is called Repeat Masker. The sequences that remain after the repetitive sequences have been masked are then compared with other databases, such as the NCBI EST database 3 and protein databases. The output from these searches can be processed and then incorporated into the Genotator or Alfresco browser (Harris 1997(Harris , 2000.

Exon Prediction.
Another approach for determining whether newly identified sequences contain a gene is to identify potential coding sequences, or exons. Several exon prediction programs (see table 2) can be used to analyze genomic sequences, including: 3 EST stands for "expressed sequence tag," which is a partial DNA sequence segment derived from randomly selected cDNAs.
• Xpound, one of the earliest genefinding programs (Kamb et al. 1995) • MZEF (Zhang 1997(Zhang , 1998 • GrailEXP, which can model more complicated gene structures, such as one gene embedded within an intron of another, and can predict exon boundaries more accurately than other programs (Xu et al. 1997) • GeneID, which has been designed with a hierarchical structure in which several steps are performed consecutively to help identify exons and genes (Parra et al. 2000) • Eukaryotic GeneMark.hmm version 2.2a • GENSCAN (Burge and Karlin 1997) • Fgenesh (Solovyev and Salamov 1997).

Additional Analyses.
Several other tools can be used to analyze unknown DNA sequences. For example, researchers can look for "open reading frames"-areas of DNA that could encode at least short proteins. One program that allows this type of anal ysis is called ORF Finder. To identify promoters-short stretches of DNA that regulate gene expression and are typically adjacent to the start site of a gene-researchers can use programs such as Neural Network Promoter Prediction (NNPP) (Reese and Eeckman 1995) or PromoterInspector (Scherf et al. 2000). Finally, investiga tors may wish to identify CpG islands, DNA regions with a charac teristic composition of DNA building blocks, which are commonly located near the starting regions of genes.
One program available for this purpose  is the previously mentioned GrailEXP package (Xu et al. 1997).

Web-Based Annotation Services
In addition to comparing genomic sequences with precomputed DNA analyses and annotations available through various Web browsers, researchers can submit their sequences via a Web site for custom analysis. This approach would be useful pri marily to researchers who through inhouse sequencing have obtained genomic DNA sequences from those regions of the genome that remain as gaps in the public database. Three major Web-based resources allow inves tigators to submit genomic sequences to a server and receive analysis results via e-mail, namely: • The NIX application, which includes seven exon prediction programs as well as other applications and allows the user to incorporate additional annotations that could be useful for giving presentations • The RUMMAGE server, which is more comprehensive than NIX but not as easy to use (Fortna and Gardiner 2001) • The Gestalt system, which also allows for incorporation of additional annotations.
Some of these services use analysis programs that have not been incorpo rated into the pre-annotated genome browsers.
All of the human genome annota tion options described so far in this article, including the pre-annotated, in-house, and Web-based systems, have been evaluated in a review by Fortna and Gardiner (2001).

Private Annotation Services
In addition to the free resources described above, a private company-Celera-provides a version of the annotated human genome at a signifi cant cost. Celera has developed a detailed, methodical approach for identifying genes that incorporates into its database various types of informa tion about all genes identified by the company (figure 5). Celera also has analyzed the mouse genome and can provide the data as an assembled and annotated genome database. Additional information about Celera's databases is available on the company's Web site.

Expression Studies
As mentioned earlier, alcoholism is a complex disease believed to develop as the result of a combination of envi ronmental influences and multiple underlying genes that may predispose some people toward the disorder. It is highly likely that an increased vulner ability to alcoholism results not only from DNA changes in the proteincoding regions of those genes, but also from more subtle variations in noncoding regions that affect when, where, and in what amount, the pro tein encoded by the gene is produced. The identification of underlying genetic differences could also provide valuable insight into the progression of alcohol tolerance, dependence, and toxic effects on the brain and could facilitate development of clinical treatments for alcoholism.
To comprehensively search for DNA changes that could contribute to alcohol-related disorders, sequences that represent gene-coding regions as well as those that influence the regula tion and expression of genes need to be investigated. One hallmark of genomic analysis is its emphasis on techniques that identify and analyze large num bers of genes, rather than a few, at one time. One of these techniques is the use of high-density "gene chips" for gene expression studies. Similar to computer chips, gene chips are covered with minute amounts of DNA derived from different genes. They allow inves tigators to simultaneously compare the expression levels of thousands of genes between different individuals or under different environmental conditions. Gene chips can be obtained through several commercial suppliers, as well as through the many academic gene chip facilities that have arisen to permit easy and inexpensive access to standard and custom sets of genes. Broadly speaking, there are three types of chip technologies (see table 1 for a list of related Web sites): • Affymetrix GeneChips, available in a number of prefabricated varieties, use short, synthetic pieces of specific DNA sequences that are generated directly on the chips. They are used to evaluate mRNA expression in several different organisms, includ ing humans, mice, and rats. Research ers can also order custom chips designed to examine specific genes of interest.
• LifeArray TM chips, designed by IncyteGenomics, contain cDNA sequences corresponding to different human, mouse, or rat genes that are placed on small, specially coated glass slides.
• "Exon" and "tiling" arrays, generated by Rosetta Inpharmatics, employ a unique exon-specific approach that enables investigators to detect gene expression differences, including splicing variants, under a variety of different tissue or cell conditions at, potentially, the level of the entire genome (see Shoemaker et al. 2001).
As mentioned previously, all of these chips allow geneticists to investi gate regulatory differences in the ex pression of thousands of genes at a time. Alcohol researchers are beginning to apply these methods in the study of both human alcoholism and alcoholrelated phenotypes in mouse models.

Applications to Alcohol Research
Several groups of alcohol researchers have begun exploring these state-ofthe-art technologies to look for varia tions in gene expression related to the development of alcoholism or alcoholrelated behaviors. Xu and colleagues (2001) used gene chips containing mouse DNA to identify 41 genes whose expression differed in the brains of 2 mouse strains called Inbred Short-Sleep (ISS) and Inbred Long-Sleep (ILS) mice, which have been genetically selected for differ ences in initial sensitivity to alcohol (McClearn and Kakihana 1981). Daniels and Buck (2002) have found evidence for differential regulation of gene expression during alcohol with drawal in the hippocampal brain region of two mouse strains called C57BL/6J and DBA/2J. In this study, the expression of more than 100 genes, most of which fell into 6 major functional categories, differed sub stantially between the 2 strains.
These examples illustrate the im portance of research involving animal models of alcohol-related phenotypes. Using animal models, researchers can carefully control and examine differ ent conditions under which gene expression may vary, such as the extent of previous alcohol exposure and the developmental stage of the animals. Use of animal models also allows investigators to focus on specific brain regions, which may help identify the specific genes, proteins, and molecu lar mechanisms that underlie alcohol's actions in the brain. Similar analyses can also be performed in cultured cells rather than in intact animals or tissues (Thibault et al. 2000).
Analysis of gene regulation in humans rather than in animal models is more challenging for numerous reasons. People who abuse alcohol vary substantially more in their genetic makeup than do inbred mouse strains. Furthermore, one cannot conduct the same experiments on humans as on animals simply because genetic material from the brains of humans is available only after their natural death, whereas animal samples can be obtained at various developmental stages. When differences in gene expression between alcohol-abusing and non-alcoholabusing people are measured after death, however, it is difficult to determine whether they result from underlying genetic differences between the Figure 5 Example of a search result using the human genome browser developed by Celera Genomics. The map includes many annotations similar to those in the public domain, as well as additional results generated by Celera from analyses of gene predictions and information about novel genes that may be related to known genes. two groups or from the alcohol abuse. Despite their limitations, postmortem analyses may yield important results. For example, a recent study of gene expression in postmortem brain sam ples from alcoholics and matched control subjects revealed significant differences in the expression of certain genes that may be related to morpho logical differences in the brains of alcoholics and nonalcoholics, such as shrinkage of white matter (Lewohl et al. 2001).
The recent delineation of substan tially complete sequences of both the human and mouse genomes has pro vided an important stimulus to genetic research in general as well as alcohol research in particular. Until now, the discovery of important disease-related gene sequence differences required that DNA sequencing be carried out by the investigating laboratory to find the rele vant sequence change. The availability of extensive human and mouse genome sequence databases, however, may elimi nate this requirement so that many such discoveries can now be made by com puter, or "in silico." For example, in the alcohol field, many mouse QTLs associ ated with certain alcohol effects were identified using the C57BL/6J and DBA/2J mice. Now, however, the draft genome sequences for these two strains are available, making it possible for inves tigators to select potential QTLs from the genome sequence databases of the two strains and, through computer analyses, search directly for any sequence differences. Although the genome sequence databases for these two strains still contain gaps, they cover a substantial part of the genome, allowing researchers to rapidly compare many QTL genes and identify differences. This approach has wide potential applicability and, in principle, could be used for QTL analy sis in all mouse strains for which rela tively complete genome sequences are available (Marshall et al. 2002).

Conclusion
Ultimately, the identification of the genetic components and mechanisms that underlie the effects of alcohol on the brain must involve a conver gence of efforts from investigators representing many fields, including behavior genetics, neurobiology, and molecular biology. Particularly because of the extensive new genomics resources that have been generated over the past decade, alcohol research ers now have unprecedented knowl edge about the human and mouse genomes. The mining of such genomebased information offers the promise of uncovering many of the key genes and pathways that underlie complex genetic diseases, including alcoholism and alcohol abuse. ■